WIP: Add rhel9 base for vGPU-manager containers #486

mvalsecchi-nv · 2025-11-10T07:50:36Z

Close #453

WIP as I have not tested it yet.

I used this RH article as that is the source of truth for RHOCP <> RHCOS matrix.

As per the supported matrix, GPU Operator only support RHOCP 4.14 or later.

I've added extra cases (4.12, 4.13), since some users might be interested in creating drivers for not NVIDIA supported environments, that can still receive updates from Red Hat with Extended Update Support Add-On. If that is not necessary, I can remove the lines covering 4.12, and 4.13.

Since in the GPU Operator docs we do instruct users to export the OS_TAG as per rhcos4.<x>, I believe the Makefile changes should not brake any existing automation script.

Let me find a lab to test it out, and update accordingly

copy-pr-bot · 2025-11-10T07:50:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Close NVIDIA#453 Signed-off-by: Michele Valsecchi <[email protected]>

mvalsecchi-nv · 2025-11-12T08:22:45Z

vgpu-manager/rhel9/Dockerfile

@@ -0,0 +1,34 @@
+FROM nvcr.io/nvidia/cuda:13.0.1-base-ubi9


Unfortunately symlinks (from vgpu-manager/rhel8 to vgpu-manager/rhel9) would not cut it, as we pass the subdir, making those files (inside rhel8 unreachable from any other sibling folder).

Let me see if I can come up with a cleaner way, rather than duplicating all the files in vgpu-manager/rhel*

It seems most directories do indeed have copies of the same files, so I'll leave the duplicates inside vpu-manager/rhel* folders instead of refactoring.

Signed-off-by: Michele Valsecchi <[email protected]>

mvalsecchi-nv · 2025-11-13T05:38:23Z

I tested commit 9a950fc155929a8235be7cab7022b7cd7882fa7c with OCP4.18.24, GPU Operator 25.10.0 and driver =580.95.02, and it work as expected.

Environment:

 $ oc get csv -n nvidia-gpu-operator
NAME                              DISPLAY               VERSION   REPLACES                         PHASE
gpu-operator-certified.v25.10.0   NVIDIA GPU Operator   25.10.0   gpu-operator-certified.v25.3.4   Succeeded

 $  oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.24   True        False         2d20h 

 $  ls -lah vgpu-manager/rhel9
[...]
-rwxr-xr-x 1 user user  97M Nov 11 16:54 580.95.02-vgpu-kvm.run

Outcome:

 $ oc get pod -n nvidia-gpu-operator
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-55f5686c46-762nh                               1/1     Running   0          17h
nvidia-sandbox-device-plugin-daemonset-v6q7q                1/1     Running   0          16h
nvidia-sandbox-validator-w7pfn                              1/1     Running   0          16h
nvidia-vgpu-device-manager-22zpw                            1/1     Running   0          16h
nvidia-vgpu-manager-daemonset-418.94.202509100653-0-zmkcl   2/2     Running   0          16h

 $ oc get clusterpolicy -o yaml | less 
  vgpuDeviceManager:
    enabled: true
  vgpuManager:
    enabled: true
    image: vgpu-manager
    imagePullSecrets:
    - podman-registry-credentials
    - private-registry-secret
    repository: <redacted>.svc:5000/openshift
    version: 580.95.02
status:
  conditions:
  - lastTransitionTime: "2025-11-12T13:24:05Z"
    message: ClusterPolicy is ready as all resources have been successfully reconciled <===
    reason: Reconciled 
    status: "True"
    type: Ready <===

Let me test the refactored version, and also create a VM, to confirm everything work as intended, and I'll remove the WIP from the title.

Add rhel9 base for vGPU-manager containers

9a950fc

Close NVIDIA#453 Signed-off-by: Michele Valsecchi <[email protected]>

mvalsecchi-nv force-pushed the issue/453 branch from c5bf10d to 9a950fc Compare November 12, 2025 08:19

mvalsecchi-nv commented Nov 12, 2025

View reviewed changes

mvalsecchi-nv added 2 commits November 13, 2025 14:25

Add notes

e434389

Signed-off-by: Michele Valsecchi <[email protected]>

Remove duplicate recipes

29c9f51

Signed-off-by: Michele Valsecchi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Add rhel9 base for vGPU-manager containers #486

WIP: Add rhel9 base for vGPU-manager containers #486

Uh oh!

mvalsecchi-nv commented Nov 10, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 10, 2025

Uh oh!

mvalsecchi-nv Nov 12, 2025 •

edited

Loading

Uh oh!

mvalsecchi-nv Nov 13, 2025

Uh oh!

mvalsecchi-nv commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WIP: Add rhel9 base for vGPU-manager containers #486

Are you sure you want to change the base?

WIP: Add rhel9 base for vGPU-manager containers #486

Uh oh!

Conversation

mvalsecchi-nv commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 10, 2025

Uh oh!

mvalsecchi-nv Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvalsecchi-nv Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

mvalsecchi-nv commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mvalsecchi-nv commented Nov 10, 2025 •

edited

Loading

mvalsecchi-nv Nov 12, 2025 •

edited

Loading